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□  □ 


Introduction 


Advances  in  the  use  of  both  distributed  conaputing  systems  and  in  parallel  computers 
have  led  to  the  consideration  of  using  a  distributed-parallel  computing  system,  which  is 
a  distributed  system  of  computing  nodes  in  which  some  of  the  nodes  may  be  parallel 
computers.  Such  systems  can  combine  attributes  of  both  systems  such  as  high  perfor¬ 
mance,  redundancy  and  reliabQity,  and  physical  proximity  of  parts  of  the  computing  to 
the  data. 

One  use  of  a  distributed-pariillel  system  is  in  the  enhancement  of  an  ordinary  distrib¬ 
uted  system  to  one  in  which  one  or  more  of  the  nodes  may  be  high  performance  com¬ 
puters.  Current  projections  of  high  performance  computing  architectures  are  that  these 
machines  will  necessarily  be  parallel  computers.  Examples  of  applications  which  will 
require  this  sort  of  distributed-parallel  system  are  those  in  which  large  numbers  of 
small  computing  nodes  which  may  service  requests  or  collect  data  may  then  send  those 
requests  or  data  collections  off  to  a  large  computing  node  for  processing.  Although 
these  applications  add  the  issue  of  heterogeneity  to  the  distributed  computing  model, 
this  issue  has  been  addressed  by  a  number  of  systems  without  otherwise  change  to  the 
underlying  distributed  processing  model. 

The  other  use  of  a  distributed-parcillel  system  is  in  the  enhancement  of  a  single  parallel 
machine  to  the  parallelism  achievable  by  many  (possibly  parallel)  nodes  in  a  distributed 
system.  Essentially,  this  idea  is  to  scale  up  the  parallelism  found  in  a  single  machine, 
where  the  communication  times  are  relatively  fast  over  a  bus  or  network,  to  a  parallel 
system  where  the  communication  times  are  relatively  slow  over  a  local  area  network 
(LAN)  or  wide  area  network  (WAN).  In  addition  to  enhanced  computing  power,  this 
use  of  a  distributed  system  could  add  the  distributed  system  capabilties  of  reliabilty  and 
redundancy,  etc.  to  a  high  performance  system.  This  use  of  a  distributed  processing 
system  may  change  the  imderlying  processing  model  since  using  the  distributed  pro¬ 
cesses  as  parallel  processes  implies  a  tighter  coupling  of  the  data  and  process  interac¬ 
tions  as  is  found  in  parallel  processing  models.  More  investigations  are  needed  in  this 
area  to  understamd  the  parameters  that  make  this  kind  of  distributed-parallel  comput¬ 
ing  feasible. 


Scope  of  work 

The  goal  of  this  project  is  to  demonstrate  the  feasibility  of  distributed-parallel  comput¬ 
ing  in  the  second  case  (as  defined  above)  by  implementing  a  parallel  application  on  a 
distributed  system  of  parallel  computers  and  designing  evaluation  criteria.  The  project 
is  designed  as  a  sequence  of  four  tasks  as  described  in  the  GIT  contract: 

1.  Establish  a  working  distributed-parallel  platform  between  the  Northeast  Parallel 
Architectures  Center  (NPAC),  located  in  Syracuse,  New  York,  and  Rome  Laboratory, 
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located  in  Rome,  New  York,  using  the  Cronus  Distributed  Operating  System.* 


2.  Design  and  specify  a  distributed-parallel  application  which  can  exploit  the  capabili¬ 
ties  of  a  diverse  set  of  parallel  tu’chitectures  where  the  entire  computation  must  be 
coordinated  over  a  set  of  individual  computational  nodes. 

3.  Perform  analysis  to  determine  an  appropriate  criteria  to  evaluate  the  performance  of 
a  distributed-parallel  system. 

4.  Based  upon  recommendations  from  Task  3,  a  candidate  demonstration  will  be  speci¬ 
fied  by  which  the  capabilities  of  the  system  developed  in  Task  1  can  be  shown. 

The  remainder  of  this  report  discusses  the  implementation  and  the  results  of  each  of 
these  tasks. 


The  Distributed-Parallel  Computing  Platform  at  Rome  and  Syracuse,  NY 

The  establishment  of  the  distributed-parallel  computing  platform  made  use  of  available 
hardware  and  software  at  Rome  Laboratory  in  Rome,  NY,  and  NPAC  at  Syracuse  Uni¬ 
versity  in  Syracuse,  NY.  Although  the  design  and  evaluation  phases  of  this  project 
were  careful  to  plan  for  a  more  general  case  in  which  the  parallel  application  may  be 
implemented  on  a  heterogeneous  distributed  system,  this  implementation  was  chosen 
to  be  carried  out  on  a  homogeneous  system  consisting  of  two  Encore  Multimaxes**. 

Both  of  these  machines  were  shared  memory  MIMD  Encore  Multimax  320's.  The  one  at 
Rome  has  16  parallel  processors  and  the  one  at  Syracuse  20  parallel  processors. 

Both  of  these  Multimaxes  are  on  local  area  networks  at  their  respective  sites  with  Sun 
workstations  (and  other  hardware  not  used  in  this  project)  and  a  gateway  to  the  WAN 
run  by  Nysemet.  Between  the  gateways  of  Rome  Laboratory  and  Syracuse  University, 
a  distance  of  about  50  miles,  is  a  T1  line  with  no  other  intermediate  gateways  (see  Fig¬ 
ure  1). 

In  addition  to  these  machines,  the  hardware  for  this  project  included  two  Hewlitt- 
Packard  4972A  Local  Area  Network  Protocol  Analyzers  (LANPAs).  These  were  pro¬ 
vided  by  Rome  Laboratory.  One  was  installed  in  the  Rome  Laboratory  LAN  and  the 
other  in  the  Syracuse  University  LAN.  These  devices  are  able  to  monitor  the  actual 
message  packets  of  any  communications  between  the  two  Multimaxes  as  the  packets  go 
out  and  come  in  through  the  gateways. 

The  distributed  operating  system  chosen  for  this  platform  was  the  Cronus  distributed 
programming  environment  from  BBN.  While  this  system  is  widely  available  as  a  het¬ 
erogeneous  distributed  system,  this  project  was  one  of  the  first  to  make  use  a  new 


*  Cronus  is  a  product  of  BBN  Systems  and  Technologies  Corporation 

**  Multimax  is  a  trademark  of  Encore  Computer  Corporation 
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extension  to  enable  distributed  processes  to  also  take  advantage  of  running  on  a  parallel 
machine  like  the  Multimax.  In  fact,  this  was  one  of  the  uncertainties  of  the  project. 
Originally,  it  was  plarmed  to  run  Cronus  on  the  Umax  operating  system  and  this  plat¬ 
form  was  established  at  Rome  on  the  Multimax  320  and  at  Syracuse,  the  Multimax  520 
(which  differs  essentially  in  having  faster  processors).  However,  the  parallel  computing 
support  was  only  available  in  Cronus  under  the  Mach  operating  system,  and  the  plat¬ 
form  was  changed  accordingly.  Rome  Laboratory  acquired  Mach  and  installed  it  with 
Cronus  on  their  Multimax  320,  and  NPAC  installed  Cronus  on  the  Mach  already  run¬ 
ning  on  their  Multimax  320.* 

Another  part  of  the  software  platform  was  the  code  written  to  collect  message  timings 
from  the  LANPAs.  The  code  to  analyze  packets  on  the  LAMP  As  was  written  by  the 
team  from  Rome  Laboratory  and  the  code  to  sychronize  clocks  between  the  LAI^As 
and  the  Multimaxes  was  written  by  a  systems  administrator  from  NPAC  and  Rome 
Laboratory. 


Rome  Laboratory  Syracuse  University 


The  Experimental  Application 

The  first  step  in  choosing  an  experimental  application  was  to  consider  the  characteris¬ 
tics  of  an  application  that  could  take  advantage  of  this  kind  of  distributed-parallel 
platform.  During  the  cor\sideration  of  characteristics  of  applications,  both  the  team 
from  Rome  Laboratory  and  the  team  from  NPAC  studied  the  distributed  and  parallel 
process  models.  Assistance  was  given  by  training  from  Rome  Laboratory  about  the 
Cronus  distributed  system  and  from  NPAC  about  parallel  processes  in  Mach. 


*  We  are  grateful  to  the  support  given  both  by  Encore  Coiporation  for  providing  Mach  and  to  BBN  for 
providing  the  parallel  process  implementation  in  Cronus. 
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Desired  Characteristics 


The  characteristics  which  were  found  to  be  appropriate  were  those  concerning  the 
computation  to  communication  ratio  and  those  of  system-wide  synchronization. 

A  suitable  application  must  have  sufficient  granularity  of  parallelism  to  take  advantage 
of  the  combined  total  of  36  processors.  It  must  not  have  so  large  an  amount  of  commu¬ 
nication  thai  the  advantages  of  using  two  paradlel  computers  is  lost.  Furthermore,  it 
must  not  have  communication  in  a  synchonizaLlon  pattern  such  that  the  advantage  of 
using  two  parallel  computers  is  again  lost.  Finally,  we  decided  against  using  an  appli¬ 
cation  with  essentially  no  communication,  what  is  called  an  “embarassingly  parallel" 
application  where  the  parallel  subprocesses  are  totally  independent  of  each  other.  This 
kind  of  application  would  indeed  take  advantage  of  two  distributed  parallel  processors, 
but  we  preferred  to  concentrate  on  the  more  challenging  case  of  an  application  where 
the  parallel  subprocesses  must  have  some  degree  of  communication  between  them. 

This  would  lead  to  an  experiment  that  would  show  whether  an  even  larger  class  of 
parallel  applications  than  the  "embarrassingly  parallel"  ones  would  be  feasible  on  this 
system. 


Multisource  Tracking  Simulation  -  Overview 

The  application  chosen  was  a  multisource  missile  tracking  program  originally  written  at 
Caltech  in  conjunction  with  JPL  for  the  Hypercube  machine.  [Gottschalk  1987]  This 
program  is  publicly  available  as  part  of  the  Caltech  benchmarking  suite.  (Messina  et  al 
1990]  The  tracking  program  receives  a  set  of  missile  coordinates  at  each  time  step  and 
establishes  a  set  of  tracks  based  on  calculated  trajectories.  The  tracking  program  is 
intended  to  be  part  of  an  application  in  a  distributed  system  of  sensors  observing  the 
missile  data,  which  is  communicated  to  a  large  processing  node  running  the  tradcing 
program.  Our  experiment  implements  the  large  processing  node  as  two  parallel  ma¬ 
chines,  and  the  sensor  data  is  generated  and  controlled  by  a  subroutine. 

The  structure  of  the  program  is  a  large  loop  based  on  time;  the  loop  body  is  executed 
once  for  each  set  of  sensor  data  (see  Figure  2).  The  main  data  structure  of  the  program 
is  the  current  set  of  tracks  being  determined.  The  loop  body  consists  of  using  the  sensor 
data  to  sequentially 

1)  extend  the  "current"  tracks  using  a  rough  filter  (which  adds  all  possible  extensions  to 
the  tracks  file) 

2)  a  precision  filter  that  deletes  bad  extensions,  and 

3)  initiating  any  new  tracks. 

Most  of  the  computational  time  is  spent  in  the  precision  filter;  all  of  these  computations 
can  be  parallelized  over  the  tracks. 

We  had  several  versions  of  the  program  to  work  with;  1)  a  parallel  version  developed 
for  a  distributed  memory  MIMD  machine,  2)  a  sequential  version  derived  from  the  first. 
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and  3)  a  parallel  version  adapted  from  the  second  for  a  shared  memory  MIMD  machine. 
The  original  code  had  been  parallelized  by  dividing  up  the  current  set  of  tracks  among 
all  the  processors.  The  precision  filter  code  is  not  completely  indepedent  in  each  track, 
but  depends  only  on  other  nearby  tracks  (primarily  to  detect  false  duplicate  tracks). 
During  each  loop  body,  tracks  are  created  and  deleted.  Tracks  must  be  moved  between 
the  processors  to:  1)  colocate  possible  duplicate  tracks  and  2)  insure  that  large  differ¬ 
ences  in  the  number  of  tracks  on  each  processor  do  not  occur.  Thus  a  load  balancing 
phase,  in  which  tracks  will  be  more  evenly  distributed  among  the  processors  is  added 
into  the  main  loop  body  just  prior  to  the  precision  filter  computation. 


Outline  of  program  Data  structure: 

(local)  tracks  file: 


(once  around  for  each  set  of  missile 
data  received) 

find  track  extensions  using  rough  filter 
redistribute  tracks  among  processors 
deletes  bad  extensions  using  precision  filter 
initiates  new  tracks 


for  each  processor 
^nain  ^ 

{  initializations: 
The  Big  Loop: 

{  get_data; 
master; 
balance; 
nl_filt; 
batch; 

1 

L  >  J 


Figure  2:  Application  Structure 


This  application  fits  our  criteria  for  a  distributed-parallel  system.  There  is  plenty  of 
parallelism  since  the  number  of  tracks  will  be  in  the  hundreds  for  even  fairly  small 
problems.  The  amount  of  computation  is  large  with  respect  to  the  amount  of  communi¬ 
cation.  Although  the  distribute  processors  must  synchronize  during  the  communica¬ 
tion  of  load  balancing,  this  only  happens  once  during  each  loop  body. 

Programming  Model 

An  important  aspect  of  this  project  was  to  examine  several  different  computational 
models:  sequential  processing,  parallel  processing,  distributed  processing  and  distrib¬ 
uted  -parallel  processing.  In  order  to  facilitate  this,  Cronus,  a  distributed  computing 
environment  developed  by  Bolt,  Beranek  and  Newman  Inc.  (BBN),  was  chosen  as  our 
development  environment.  Cronus  utilizes  an  object-oriented  client/server  program¬ 
ming  model.  Cronus’  strengths  lie  in  its  support  for  a  number  of  different  platforms 
(heterogeneity)  and  support  for  fault-tolerance. 
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In  Cronus,  objects  are  passive  and  are  "managed"  by  a  multi-threaded  manager  (server). 
As  such,  each  manager  represents  a  shared  address  space  for  a  set  of  common  typed 
objects.  This  maps  well  to  either  a  uniprocessor  or  a  shared-memory  multiprocessor. 
New  threads  of  control  are  automatically  generated  to  handle  an  operation  invocation 
(request)  on  the  server.  In  addition,  new  threads  may  be  explicitly  created. 

If  several  threads  of  control  were  ultra-lightweight  and  a  server  had  a  very  large  buffer 
for  pending  operations,  a  very  general  programming  model  develops.  This  model  is  one 
in  which  the  managed  objects  are  tracks  and  a  simulation  control  task  (associated  with 
each  manager)  generates  asynchronous  op)eration  requests  for  each  track  in  the  local 
tracks  "database".  The  result,  on  different  architectures,  would  be  sequential  processing 
on  a  uniprocessor,  and  N-way  multiprocessing  on  a  N-node  shared  memory  multipro¬ 
cessor. 

As  will  be  noted  in  detail  later,  experimentation  determined  that  no  thread  is  infinitely 
"lightweight"  and  there  are  real  limits  to  the  buffering  capacity  of  a  server.  Thus,  the 
application  was  coded  to  uniformly  distribute  tracks  to  N  explicitly  created  threads  on 
an  N-node  processor. 

Partitioning 

One  manager.  Tracks  Manager,  was  implemented  in  C  and  instantiated  once  on  each 
distributed  node  (although  we  implemented  this  code  on  two  nodes,  the  code  was 
written  more  generally).  Cronus  manages  machine  dependandes  and  permits  the  code 
to  be  portable  to  a  number  of  different  platforms.  Within  each  manager,  the  code  is 
parallelized  over  the  tracks  assigned  to  that  node  (see  Fig.  3).  The  major  p’-ogramming 
tasks  were  to  rewrite  the  load  balancing  to  use  Cronus  message-passing  (by  invoking  a 
method  in  another  manager),  and  in  implementing  the  shared-memory  parallelism 
within  each  distributed  ncxie. 

In  implementing  the  pau-aUelism  within  each  distributed  node,  we  used  an  extension  of 
Cronus  in  which  the  distributed  process  model  was  extended  to  a  parallel  process 
mcxlel.  In  distributed  process  model,  multiple  invocations  of  operations  (methods)  in  a 
manager  were  shared  memory  concurrent  processes.  But,  each  of  these  processes  was 
implemented  in  a  coroutine  fashion  on  a  single  process  Dr.  Under  this  implementation, 
there  were  no  synchronization  problems  with  using  shared  memory  since  each  process 
was  guaranteed  to  run  to  completion  or  until  it  yielded  the  processor. 

The  extention  to  allow  truly  parallel  execution  of  the  processes  within  one  manager  has 
processes  request  to  run  on  their  processsor.  Hence,  multiple  invocations  of  operations 
can  run  many  processors.  However,  now  if  more  than  one  procedure  may  modify  some 
data  in  the  shared  memory,  synchronization  must  be  used  to  ensure  data  integrity. 
Semaphores  are  available  for  this  synchronization. 
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Managerl ; 

master; 
for  all  tracks 

Extendtrack(trackid); 


where 

Extendtrack{x) 

{  Getownprocessor; 


(shared)  trackfile: 

I  M  n 


processors; 


tr1 


tr2 


tr3 


} 


Releaseownprocessor; 

Figure  3:  Form  of  parallelism  in  each  manager 


The  general  decomposition  of  "paraliel"  operations  is  demonstrated  by  example  in  Fig.3. 
The  original  code  would  sequentially  execute  an  operation,  e  g.,  extendtrack,  for  each 
track  (in  the  managers  tracktile).  The  parallel  code  creates  an  independent  task  (thread) 
for  each  operation  where  the  operation  code  explicitly  asks  for  independent  sheduling 
via  the  TaskObtainOwnProcessor  ()  call. 

The  Tracks  Manager  exports  two  distinct  interfaces  (sets  of  operations).  The  first  inter¬ 
face  is  used  by  the  sensor  data  generator.  There  are  three  op)erations  available  in  this 
interface:  NewSimulation,  Simlnit,  and  AcceptData.  NewSimulation  is  used  to  deter¬ 
mine  the  availability  of  a  Tracks  Manager  (only  one  such  manager  can  exist  on  a  host 
and  it  was  coded  so  that  only  one  simulation  could  be  active  at  a  time).  Simlnit  is  used 
to  pass  general  simulation  data  (initialization)  to  a  manager,  e.g.  number  and  location  of 
sensors.  Finally,  AcceptData  is  invoked  for  each  new  sensor  scan  to  pass  the  sensor  data 
to  the  Tracker. 

The  second  is  the  one  which  enables  managers  from  multiple  hosts  to  cooperate  on  a 
single  simulation.  These  are  the  operations  used  to  implement  the  (load)  balancing 
phase.  This  interface  has  two  operations:  AcceptTrackList  amd  AddTrack. 
AcceptTrackList  is  used  to  transfer  a  summary  of  a  manager's  local  list  of  tracks  to  other 
managers.  This  information  is  needed  so  that  each  manager  can  autonomously  deter¬ 
mine  which  tracks  must  be  transfered  and  to  which  manager.  AddTrack  is  the  interface 
which  tramsfers  a  block  of  tracks  from  one  mamager  to  the  other  (see  Fig.  4). 


Distributed-Parallel  Software  Lessons 

Even  though  it  was  not  a  main  goal  of  our  project  to  evaluate  a  distributed-parallel 
software  environment  in  general!,  or  Cronus  in  particulam,  since  we  did  use  Cronus  for 
this  project,  a  discussion  of  the  problems  and  successes  in  using  this  kind  of  software 
may  be  helpful  to  software  designers. 
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Manager  1 ; 


Manager  2: 


r- - ^ 

mam 

{  initializations: 

The  Big  Loop; 

{  get_data: 
master; 
balance: 

call  AddTrack(trackid); 

nl_filt: 
batch; 

} 

U _ / 


mam 

(  initializations; 

The  Big  Loop: 

{  get_data: 
master; 
balance; 


•Addtrack{x)  {  }; 

nl_filt: 

batch; 


The  message  "trackid"  is  passed  by 
invoking  a  method  in  the  other  manager. 

Figure  4:  Message  passing  between  managers 


The  adva;itages  of  using  a  high-level  software  system  like  Cronus,  instead  of  a  low-level 
system  like  Mach  or  Express,  are  great.  We  felt  that  the  time  spent  porting  code  was 
very  small  due  to  the  tools  for  managing  a  group  of  distributed  processes  and  the 
simple  message-passing  model.  It  was  also  easy  to  design  for  a  more  general  system 
than  we  actually  had;  it  was  easy  to  structure  the  code  for  arbitrary  numbers  of  nodes 
and  hetereogeneity  is  built  in.  We  also  felt  that  the  extention  of  Cronus  to  include  paral¬ 
lel  processing  nodes  was  successful  on  the  shared  memory  parallel  processors. 

Of  course,  we  must  exp>ect  to  pay  some  overhead  for  using  a  high-level  heterogeneous 
software  system,  but  there  were  a  couple  of  areas  where  we  felt  that  the  overheads  were 
unacceptably  large.  Where  possible  we  modified  the  software  design  so  as  to  better 
reflect  the  needs  for  our  analysis.  In  other  words,  where  possible  we  didn't  want  our 
results  to  simply  reflect  constraints  imposed  by  our  design  environment. 

Our  first  design  of  the  code  was  a  standard  Crouus,  persis*^ant  object-oriented  cne 
design,  in  which  tracks  were  objects  within  the  managers.  However,  Cronus  keeps  these 
in  a  persistant  database,  which  is  appropriate  for  applications  in  which  reliability  and 
redimdancy  are  most  important.  In  our  application,  with  rapidly  changing  data  as 
tracks  are  added  and  removed,  the  overhead  of  persistant  objects  was  prohibitive  and 
we  made  them  into  an  ordinary  volatile  data  structure. 

Our  initial  pass  at  implementing  parallel  threads  was  to  take  advantge  of  the  implicit 
task  creation  which  occurs  during  an  object  invocation  on  a  manager.  In  this  scenario,  a 
manager  would  have  a  coordinating  task  (executing  the  main  loop),  which  would  issue 
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asynchronous  invocations  upon  the  manager  for  each  "track".  Under  this  design,  if  a 
manager  could  allocate  multiple  processors  for  incoming  invocations  (in  the  case  of  a 
uniprocessor),  it  would  just  queue  up  the  pending  invocations  and  serve  them  sequen¬ 
tially.  In  fact  the  TaskOwnObtainProcessor  ()  nicely  handles  this  situation  in  a  way  that 
the  manager  code  could  be  identical  on  both  a  uniprocessor  and  a  multiprocessor. 

Two  separate  performance  issues  required  this  solution  to  be  redesigned.  Mrst,  op>era- 
tion  invocation  (and  the  subsequent  task  creation)  is  a  relatively  expensive  activity  (as 
implemented  in  Cronus  on  the  Mach  OS  on  the  Encore  Multimax).  Thus  some  opera¬ 
tions,  which  were  independent  over  tracks,  could  not  effectively  parallelized  since 
the  invocation  time  dominated  the  actual  processing  time.  Second,  in  any  client-server 
(message-passing)  environment,  message  queues  are  implemented  as  finite  buffers.  We 
found  that  we  could  easily  overrun  the  message  queue  buffer.  Messages  were  lost  if  we 
did  not  explidty  program  in  a  wait  after  a  certain  number  of  messages  were  sent.  This  is 
a  common  problem  in  message-passing  systems  and  one  which  high-level  software 
paradigms  should  address. 

In  the  final  implementation,  a  single  invocation  was  made  to  a  parallel  code  block.  The 
code  then  checks  to  determine  the  number  of  available  processors  and  creates  up  to  that 
number  of  parallel  tasks,  and  correspondingly  partitions  up  the  track  file  to  each  task 
(data  parallelism).  The  resulting  code  was  still  portable  between  both  uniprocessors  and 
multiprocessors.  In  addition,  the  partitioning  code  could  amortize  the  cost  of  task  cre¬ 
ation  over  the  amount  of  computation  each  task  would  perform.  The  result  was  signifi¬ 
cantly  better  parallel  performance. 

Performance  Evaluation  Methodology 

The  most  important  aspject  of  performance  evaluation  of  a  distributed-parallel  system  is 
its  system-wide  performance,  as  opposed  to  single  parallel  processor  performance  or 
distributed  operating  system  performance.  This  unit  of  measure  is  essentially  that  of 
elapsed  time  of  the  implementation  and  can  be  compared  with  the  other  modes  of 
computing. 

System-wide  comparisons  which  can  be  made  for  any  distributed-parallel  application 
are; 

•  Sequential  —  running  the  application  on  one  node  with  one  processor. 

•  Parallel  —  running  the  application  on  one  node  with  many  processors. 

•  Distributed-parallel  —  running  the  application  on  more  than  one  node  where 
each  node  may  run  many  processors. 

•  Distributed  —  running  the  application  on  more  than  one  node  where  each  node 
only  runs  one  processor. 

In  addition,  if  feasible,  one  can  compare  the  effects  of  nodes  of  local  area  networks 
versus  wide  area  networks.  In  our  case,  our  local  area  network  machines  were  Sparc 
workstations,  so  that  our  comparisons  had  to  factor  in  relative  cpu  speeds.  We  also  had 
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Figure  5:  Analysis  Objectives 


available  more  than  one  version  of  the  code  so  that  we  could  compare  the  software 
overhead  of  distributed  or  parallel  computing. 

Analysis  of  system-wide  performance  can  be  bolstered  by  analyses  of  individual  p>erfor- 
mance  components.  These  can  be  peirticularly  important  if  one  is  looking  for  areas  in 
which  to  improve  the  overall  system-wide  performance. 

One  important  area  is  the  communication  vs.  computation  ratio.  We  found  it  particu¬ 
larly  useful  to  analyze  our  application  in  terms  of  a  commuiucation/computation 
profile.  For  this,  we  subdivide  our  application  into  components  which  would  have 
different  communication /computation  ratios  based  upon  the  algorithms  used  or  which 
had  different  degrees  of  parallelism  .  Again  the  unit  of  measure  in  comparing  different 
parts  of  the  profile  was  elapsed  time.  We  found  this  profile  was  useful  both  in  tuning 
our  code  for  performance  and  in  extrapolating  how  the  performance  of  the  application 
would  scale  in  terms  of  size.  This  analysis  supersedes  the  more  traditional  measure¬ 
ments  of  parallel  scaling. 

Another  important  area  is  load  balancing.  This  can  be  measured  as  the  percent  of 
elapsed  time  that  any  node  has  to  wait  for  other  nodes  at  a  synchronization  point. 
Finally,  some  measurements  may  need  to  be  taken  of  the  communication  traffic  on  the 
wide  area  network.  Although  interference  from  other  users  of  the  processors  or  local 
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area  networks  can  quite  likely  be  controlled  since  they  are  locally  "owned",  interference 
over  a  wide  area  network  is  quite  likely  not  controllable.  Thus,  it  may  be  important  to 
establish  the  performance  of  the  communication  under  a  variety  of  network  loadings. 
Hardware  devices  like  the  HP  LANPAs  can  be  used  to  collect  time  stamps  of  the  pack¬ 
ets. 

Experimental  Results 

As  the  final  phase  of  our  project,  the  application  was  run  on  the  distributed-parallel 
platform  established  between  Rome  Laboratory  and  Syracuse  University.  The  clocks 
were  synchronized  between  the  two  Multimaxes,  the  application  code  was  initiated  on 
the  two  Multimcix  nodes  by  Cronus,  and  time  stamps  were  recorded  by  the  application 
and  by  the  two  LANPAs. 

Although  data  analysis  was  not  specified  as  a  task  of  the  project,  some  preliminary 
analysis  was  done  of  system-wide  performance  based  on  the  elasped  times  of  communi¬ 
cation/computation  profile.  In  fact,  based  on  the  times  of  the  initial  demonstration, 
some  performance  tuning  was  done  on  the  code.  We  successfully  shortened  the  com¬ 
munication  phase  by  sending  groups  of  tracks  together,  i.e.  we  switched  to  fewer, 
longer  messages.  We  also  adjusted  some  parts  of  the  program  which  had  nunimal 
amounts  of  parallelism  as  to  whether  they  actuallly  ran  on  parallel  processors  or  ran 
sequentially. 

The  communication/computation  profile  that  we  devised  for  our  application  was  very 
simply  divided  in  terms  of  the  main  subroutines.  Both  the  master  routine  and  the  batch 
routine  have  some  sequential  parts  and  some  minimally  parallel  parts,  by  which  we 
mean  that  the  procedures  that  can  be  rvm  in  parallel  are  so  short  as  to  barely  make  it 
worthwhile  to  incur  the  overhead  of  parallel  invocation.  The  balance  routine  has  all  the 
message-passing  of  the  whole  program;  it’s  times  are  overwhelmingly  conununication. 
Finally,  the  ”nl_fih"  routine  has  the  main  computation  of  the  precision  filter;  it  has 
substantial  parallelism  in  terms  of  the  amount  of  computation  per  parallel  invocation. 
All  the  possible  parallel  routines  have  a  grain  size  much  larger  than  the  total  number  of 
processors  of  the  two  machines. 

We  tested  our  programs  with  a  number  of  data  sets  which  varied  in  the  number  of 
missiles  to  be  traced,  the  timing  sequence  in  which  the  missiles  were  launched,  etc. 
However,  relative  timings  did  not  vary  significantly  over  these  data  sets,  so  we  chose 
one  data  set  to  report  on  here.  A  profile  of  this  data  set  over  100  iterations  of  programs 
is  shown  in  Figure  6.  In  this  data  set,  the  bulk  of  the  new  missile  tracks  occurs  between 
iterations  10  and  25,  yielding  the  peak  of  the  tracks  file,  which  contains  all  potential 
tracks.  As  tracks  are  filtered,  the  number  of  tracks  settles  down  by  iteration  65  to  the 
correct  number  of  missiles,  around  680.  After  that  time,  tracks  are  dropp)ed  by  the 
program  as  they  enter  the  "post-boot  phase",  which  this  program  does  not  process. 

The  first  set  of  timing  tests  that  we  ran  was  to  compare  sequential  versus  parallel.  In 
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fact,  we  varied  the  number  of  processors  on  one  parallel  machine  from  1  to  16  and 
measured  traditional  parallel  speedups.  An  example  of  these  results  is  shown  in  Figure 
7.  In  this  graph,  the  running  time  of  all  procedures  is  totalled  at  each  iteration  step.  The 
biggest  speedup  occurs  between  one  processor  and  two  processors,  where  the  tot^  time 
is  almost  halved.  Good  speedups  continue  as  processors  are  added  until  five  proces¬ 
sors  and  then  not  more  speedups  occur.  Suprisingly,  this  application  only  has  useful 
parcdlelism  for  up  to  five  processors. 

Since  we  were  surprised  at  this  result,  we  did  extensive  timings  on  the  individual 
procedures  of  the  applications.  In  particular,  we  found  that  this  limit  on  parcdlelism 
was  alsc  rue  of  the  nl_filt  procedure,  which  is  the  main  computational  module  and 
should  i  parallel  in  the  number  of  tracks.  In  Figure  8,  we  show  the  cumulative  run¬ 
ning  time  of  this  module,  that  is,  at  each  iteration  we  add  in  that  iteration's  time  to  a 
running  total.  This,  again,  clearly  shows  the  limitation  of  five  processors,  which  again 
points  up  the  overhead  limitations  imposed  by  process  invocation  in  this  software 
model. 

We  proceeded  to  the  timings  of  the  main  goal  of  our  project,  which  was  to  compare  all 
the  processing  modes:  sequential,  distributed,  parallel,  and  distributed-parallel.  In 
Figure  9,  these  results  are  shown  on  a  graph  of  cumulative  running  times  for  all  proce¬ 
dures  up  to  100  iterations.  We  limited  the  parallel  case  to  six  processors  and  the  distrib¬ 
uted  parallel  case  (marked  DP  in  the  figure)  to  six  processors  on  each  of  the  two  distrib¬ 
uted  machines,  in  view  of  our  previous  result  on  the  parallel  overhead  limitation.  Inter¬ 
estingly,  the  distributed-parallel  case  still  shows  substantial  improvement  over  the 
parallel  case,  although  it  is  far  from  being  twice  as  fast.  We  noted  that  in  the  distribu¬ 
tion  process,  which  is  also  true  in  the  distributed  versus  sequential  cases,  the  tracks  file 
is  divided  in  half  and  mamy  of  the  task's  applications,  such  as  comparing  a  track  to  all 
other  local  tracks,  were  also  halved  in  time.  Offsetting  this  gain,  of  course,  was  the  time 
spent  in  dividing  the  tracks  and  in  communicating  them  to  the  other  machine. 

Finally,  we  would  like  to  show  some  results  regarding  the  communication/computa¬ 
tion  profile.  For  this,  it  is  instructive  to  view  several  different  problem  sizes.  In  Figures 
10, 11,  and  12,  we  show  cumulative  elasped  times  for  the  different  modules  in  different 
computing  modes  for  problem  sizes  of  130, 385,  and  680  targets,  respectively.  The  cu¬ 
mulative  elasped  times  are  shown  on  a  non-uniform  scale  so  that  the  conamunication/ 
computation  ratios  are  shown  for  each  problem  size.  From  theses  figures,  we  see  that 
the  percentage  of  time  spent  in  the  commuiucation  phase,  contained  in  the  balance 
routines,  tend  to  decrease  as  the  problem  size  gets  larger.  Although  this  is  too  small  a 
collection  to  infer  continuing  decreases,  it  certainly  indicates  hope  for  successful  distrib¬ 
uted  implementations  of  very  large  problems  of  this  type. 

Conclusions 

This  project  certainly  demonstrates  the  feasibility  of  distributed-parallel  computing  for 
an  interesting  class  of  applications.  No  precise  performance  judgements  can  be  given 
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based  on  such  a  linvited  investigation,  but  much  promise  is  'hown  for  the  futrire.  All  of 
the  components  of  this  system  —  the  Fncore  Multimax  processors,  the  T1  line,  and  the 
software  —  are  knovm  to  be  slower  than  the  components  that  will  be  available  in  the 
near  future.  This  project  shows  no  reason  not  to  believe  that  the  increased  performance 
shown  on  this  platform  cannot  scale  to  future  faster  systems. 
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Figure  7: 

Parallel  Speedup:  Total  running  time  of  all  procedures  per  iteration 
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Figure  9: 

Mode  Profile:  Cumulative  time  for  all  procedures  shown  for  each  processing  mode. 
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Figure  10: 

Elasped  time  for  program  modules  on  a  small  sized  problem 
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Elasped  time  for  program  modules  on  a  large  sized  problem 
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